Python for data analysis : Final project

Subject : Estimation of obesity levels based on eating habits and physical condition

Team members : Louis MARTYR, Killian LAFAYE, Marc LEMAISTRE

I. Introduction

Problematic : Can we give a realistic estimation of the obesity level of someone purely based on their eating habits and physical condition (without their weight)

For this, we will use this dataset. It represents data collected online.

We found some limits about this dataset that you have to keep in mind while seing the results:

  • 77% of the data were generated artifically with the Weka Tool, which is tremendous (only 485 real input out of 2111) with a hugh representativity of normal BMI cases
  • The method used for oversampling is not meant to be used in the case of the study. Indeed, there isn't enough data for extreme categories. We will see the consequences later on this notebook
  • These data were collected in Colombia, Peru and Mexico (see). As it was not representative they generated data so the BMI would fit to the national mexican one. However the mexican BMI is not the colombian BMI.
  • </ul> </p> To sum up, we will try to cluster our dataset using all of the important features we have at hand except for the weight values, and we will see if our fears concerning the artificial data were justified or not.

II. Data Preprocessing

Let's start by importing every library we will need throughout our notebook
In [2]:
import pandas as pd
import numpy as np
import datetime
import scipy as sc
import random

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler

from scipy.spatial.distance import pdist, squareform

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
First let's actually import the dataset, and quickly glance at what it looks like
In [3]:
df = pd.read_csv("ObesityDataSet.csv", sep=',')
df.head()
Out[3]:
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS NObeyesdad
0 Female 21.0 1.62 64.0 yes no 2.0 3.0 Sometimes no 2.0 no 0.0 1.0 no Public_Transportation Normal_Weight
1 Female 21.0 1.52 56.0 yes no 3.0 3.0 Sometimes yes 3.0 yes 3.0 0.0 Sometimes Public_Transportation Normal_Weight
2 Male 23.0 1.80 77.0 yes no 2.0 3.0 Sometimes no 2.0 no 2.0 1.0 Frequently Public_Transportation Normal_Weight
3 Male 27.0 1.80 87.0 no no 3.0 3.0 Sometimes no 2.0 no 2.0 0.0 Frequently Walking Overweight_Level_I
4 Male 22.0 1.78 89.8 no no 2.0 1.0 Sometimes no 2.0 no 0.0 0.0 Sometimes Public_Transportation Overweight_Level_II
In [4]:
df.columns
Out[4]:
Index(['Gender', 'Age', 'Height', 'Weight', 'family_history_with_overweight',
       'FAVC', 'FCVC', 'NCP', 'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE',
       'CALC', 'MTRANS', 'NObeyesdad'],
      dtype='object')
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             2111 non-null   float64
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   CAEC                            2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  SCC                             2111 non-null   object 
 12  FAF                             2111 non-null   float64
 13  TUE                             2111 non-null   float64
 14  CALC                            2111 non-null   object 
 15  MTRANS                          2111 non-null   object 
 16  NObeyesdad                      2111 non-null   object 
dtypes: float64(8), object(9)
memory usage: 280.5+ KB

We can see that all of the columns do not have a single missing value, which is good news as it shows that the dataset is already relatively clean.

In [6]:
df.shape
Out[6]:
(2111, 17)
In [7]:
df.describe()
Out[7]:
Age Height Weight FCVC NCP CH2O FAF TUE
count 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000
mean 24.312600 1.701677 86.586058 2.419043 2.685628 2.008011 1.010298 0.657866
std 6.345968 0.093305 26.191172 0.533927 0.778039 0.612953 0.850592 0.608927
min 14.000000 1.450000 39.000000 1.000000 1.000000 1.000000 0.000000 0.000000
25% 19.947192 1.630000 65.473343 2.000000 2.658738 1.584812 0.124505 0.000000
50% 22.777890 1.700499 83.000000 2.385502 3.000000 2.000000 1.000000 0.625350
75% 26.000000 1.768464 107.430682 3.000000 3.000000 2.477420 1.666678 1.000000
max 61.000000 1.980000 173.000000 3.000000 4.000000 3.000000 3.000000 2.000000
In [8]:
display(df[df.duplicated()])
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS NObeyesdad
98 Female 21.0 1.52 42.0 no no 3.0 1.0 Frequently no 1.0 no 0.0 0.0 Sometimes Public_Transportation Insufficient_Weight
106 Female 25.0 1.57 55.0 no yes 2.0 1.0 Sometimes no 2.0 no 2.0 0.0 Sometimes Public_Transportation Normal_Weight
174 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
179 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
184 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
209 Female 22.0 1.69 65.0 yes yes 2.0 3.0 Sometimes no 2.0 no 1.0 1.0 Sometimes Public_Transportation Normal_Weight
309 Female 16.0 1.66 58.0 no no 2.0 1.0 Sometimes no 1.0 no 0.0 1.0 no Walking Normal_Weight
460 Female 18.0 1.62 55.0 yes yes 2.0 3.0 Frequently no 1.0 no 1.0 1.0 no Public_Transportation Normal_Weight
467 Male 22.0 1.74 75.0 yes yes 3.0 3.0 Frequently no 1.0 no 1.0 0.0 no Automobile Normal_Weight
496 Male 18.0 1.72 53.0 yes yes 2.0 3.0 Sometimes no 2.0 no 0.0 2.0 Sometimes Public_Transportation Insufficient_Weight
527 Female 21.0 1.52 42.0 no yes 3.0 1.0 Frequently no 1.0 no 0.0 0.0 Sometimes Public_Transportation Insufficient_Weight
659 Female 21.0 1.52 42.0 no yes 3.0 1.0 Frequently no 1.0 no 0.0 0.0 Sometimes Public_Transportation Insufficient_Weight
663 Female 21.0 1.52 42.0 no yes 3.0 1.0 Frequently no 1.0 no 0.0 0.0 Sometimes Public_Transportation Insufficient_Weight
763 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
764 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
824 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
830 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
831 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
832 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
833 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
834 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
921 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
922 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I
923 Male 21.0 1.62 70.0 no yes 2.0 1.0 no no 3.0 no 1.0 0.0 Sometimes Public_Transportation Overweight_Level_I

We can see that there are only 24 duplicated values out of 2111 originally, so we believe that we can afford to remove them, even though they MIGHT actually represent different people that happen to have the exact same characteristics.

Having the same individual multiple times will affect badly our prediction due to the algorithm we'll use, so we'll remove them.

Also, we said earlier that in this dataset, 77% of the data has been artificially implemented, so it is quite possible that some, if not all of these duplicated data correspond to artificial data that copied the values of another row.

In [9]:
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087 entries, 0 to 2086
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2087 non-null   object 
 1   Age                             2087 non-null   float64
 2   Height                          2087 non-null   float64
 3   Weight                          2087 non-null   float64
 4   family_history_with_overweight  2087 non-null   object 
 5   FAVC                            2087 non-null   object 
 6   FCVC                            2087 non-null   float64
 7   NCP                             2087 non-null   float64
 8   CAEC                            2087 non-null   object 
 9   SMOKE                           2087 non-null   object 
 10  CH2O                            2087 non-null   float64
 11  SCC                             2087 non-null   object 
 12  FAF                             2087 non-null   float64
 13  TUE                             2087 non-null   float64
 14  CALC                            2087 non-null   object 
 15  MTRANS                          2087 non-null   object 
 16  NObeyesdad                      2087 non-null   object 
dtypes: float64(8), object(9)
memory usage: 277.3+ KB
Now let's start tweaking some of the values and labels of our dataset to make everything cleaner

First, let's rename the different columns so that all of them are more explicit towards what they represent.

For this I'm using the variables info found on https://www.sciencedirect.com/science/article/pii/S2352340919306985?via%3Dihub

In [10]:
df.columns = ['Gender', 'Age', 'Height', 'Weight', 'Family_history_with_overweight', 'Frequency_eat_high_caloric_food', 'Frequency_eat_vegetables', 'Number_of_main_meals',
              'Frequency_eat_between_meals', 'Smoke', 'Frequency_water', 'Monitoring_calories_consumption', 'Frequency_physical_activity', 'Time_using_technology_devices',
              'Frequency_alcohol', 'Main_transport', 'Obesity_level_category']
df.head()
Out[10]:
Gender Age Height Weight Family_history_with_overweight Frequency_eat_high_caloric_food Frequency_eat_vegetables Number_of_main_meals Frequency_eat_between_meals Smoke Frequency_water Monitoring_calories_consumption Frequency_physical_activity Time_using_technology_devices Frequency_alcohol Main_transport Obesity_level_category
0 Female 21.0 1.62 64.0 yes no 2.0 3.0 Sometimes no 2.0 no 0.0 1.0 no Public_Transportation Normal_Weight
1 Female 21.0 1.52 56.0 yes no 3.0 3.0 Sometimes yes 3.0 yes 3.0 0.0 Sometimes Public_Transportation Normal_Weight
2 Male 23.0 1.80 77.0 yes no 2.0 3.0 Sometimes no 2.0 no 2.0 1.0 Frequently Public_Transportation Normal_Weight
3 Male 27.0 1.80 87.0 no no 3.0 3.0 Sometimes no 2.0 no 2.0 0.0 Frequently Walking Overweight_Level_I
4 Male 22.0 1.78 89.8 no no 2.0 1.0 Sometimes no 2.0 no 0.0 0.0 Sometimes Public_Transportation Overweight_Level_II

We can see that some of the columns seem to have qualitative data.

For complexity and storage purposes, we will replace these values by integer values, but first let's check which columns apply to this transformation we want to do :

In [11]:
for col in df.columns:
  if df[col].dtype=='object':
    print(f"{col} : {df[col].unique()}")
Gender : ['Female' 'Male']
Family_history_with_overweight : ['yes' 'no']
Frequency_eat_high_caloric_food : ['no' 'yes']
Frequency_eat_between_meals : ['Sometimes' 'Frequently' 'Always' 'no']
Smoke : ['no' 'yes']
Monitoring_calories_consumption : ['no' 'yes']
Frequency_alcohol : ['no' 'Sometimes' 'Frequently' 'Always']
Main_transport : ['Public_Transportation' 'Walking' 'Automobile' 'Motorbike' 'Bike']
Obesity_level_category : ['Normal_Weight' 'Overweight_Level_I' 'Overweight_Level_II'
 'Obesity_Type_I' 'Insufficient_Weight' 'Obesity_Type_II'
 'Obesity_Type_III']

Now that we know which column to take care of, let's change the values for all of the object type columns, and store their decoded labels in a dictionnary

In [12]:
df['Age'] = df['Age'].astype('uint8')
df['Height'] = df['Height'].round(2)
df['Weight'] = df['Weight'].round(1)
df.shape
Out[12]:
(2087, 17)

We want to make the data AI friendly. For that we have to numerize them. We'll use one hot encoder for the main transport category because we think that there is no mathematical relation between the different kinds of transport.

For example we didn't manage to prove that "Walking = Public transportation*2". One hot encoding creates a different column for each value of the category and assign either 1 or 0 in it.

For the rest of the columns we used the LabelEncoder, which makes our dataset more easily readable by keeping the different values possible in the same column.

In [13]:
#Using OneHotEncoder for the Main_transport column
encoder = OneHotEncoder()
encoder.fit(pd.DataFrame(df['Main_transport']))
encoded_data = encoder.transform(pd.DataFrame(df['Main_transport']))
dense_encoded_data = encoded_data.toarray()
cols = [f'Main_transport_{col}' for col in encoder.categories_[0]]
encoded_df = pd.DataFrame(dense_encoded_data, columns=cols)
encoded_df = encoded_df.astype('uint8')
df = pd.concat([df.drop(columns=['Main_transport']), encoded_df], axis=1)

#Using LabelEncoder for the rest of the object type columns
encoder = LabelEncoder()
decoded_dict = {}
for col in df.columns:
  if df[col].dtype=='object':
    encoded_labels = encoder.fit_transform(df[col])
    df[col] = encoded_labels
    decoded_dict[col] = {label: value for label, value in zip(encoded_labels, encoder.inverse_transform(encoded_labels))}
    df[col] = df[col].astype('uint8')

decoded_dict['Age'] = 'Numeric values'
decoded_dict['Height'] = 'Numeric values'
decoded_dict['Weight'] = 'Numeric values'
decoded_dict['MBI'] = 'Numeric values'
decoded_dict['Frequency_eat_vegetables'] = {1: 'Never', 2: 'Sometimes', 3: 'Always'}
decoded_dict['Number_of_main_meals'] = {1: '1/day', 2: '2/day', 3: '3/day', 4: '3+/day'}
decoded_dict['Frequency_water'] = {1: 'Less than 1L/day', 2: 'Between 1 and 2L/day', 3: 'More than 2L/day'}
decoded_dict['Frequency_physical_activity'] = {0: 'No activity', 1: '1 or 2 days/week', 2: '2 or 4 days/week', 3: '4+ days/week'}
decoded_dict['Time_using_technology_devices'] = {0: '0-2 hours/day', 1: '3-5 hours/day', 2: '5+ hours/day'}
decoded_dict['Main_transport_Automobile'] = {0: 'no', 1: 'yes'}
decoded_dict['Main_transport_Bike'] = {0: 'no', 1: 'yes'}
decoded_dict['Main_transport_Motorbike'] = {0: 'no', 1: 'yes'}
decoded_dict['Main_transport_Public_Transportation'] = {0: 'no', 1: 'yes'}
decoded_dict['Main_transport_Walking'] = {0: 'no', 1: 'yes'}

df.head()
Out[13]:
Gender Age Height Weight Family_history_with_overweight Frequency_eat_high_caloric_food Frequency_eat_vegetables Number_of_main_meals Frequency_eat_between_meals Smoke ... Monitoring_calories_consumption Frequency_physical_activity Time_using_technology_devices Frequency_alcohol Obesity_level_category Main_transport_Automobile Main_transport_Bike Main_transport_Motorbike Main_transport_Public_Transportation Main_transport_Walking
0 0 21 1.62 64.0 1 0 2.0 3.0 2 0 ... 0 0.0 1.0 3 1 0 0 0 1 0
1 0 21 1.52 56.0 1 0 3.0 3.0 2 1 ... 1 3.0 0.0 2 1 0 0 0 1 0
2 1 23 1.80 77.0 1 0 2.0 3.0 2 0 ... 0 2.0 1.0 1 1 0 0 0 1 0
3 1 27 1.80 87.0 0 0 3.0 3.0 2 0 ... 0 2.0 0.0 1 5 0 0 0 0 1
4 1 22 1.78 89.8 0 0 2.0 1.0 2 0 ... 0 0.0 0.0 2 6 0 0 0 1 0

5 rows × 21 columns

In [14]:
for col in decoded_dict:
  print(f"{col} : {decoded_dict[col]}")
Gender : {0: 'Female', 1: 'Male'}
Family_history_with_overweight : {1: 'yes', 0: 'no'}
Frequency_eat_high_caloric_food : {0: 'no', 1: 'yes'}
Frequency_eat_between_meals : {2: 'Sometimes', 1: 'Frequently', 0: 'Always', 3: 'no'}
Smoke : {0: 'no', 1: 'yes'}
Monitoring_calories_consumption : {0: 'no', 1: 'yes'}
Frequency_alcohol : {3: 'no', 2: 'Sometimes', 1: 'Frequently', 0: 'Always'}
Obesity_level_category : {1: 'Normal_Weight', 5: 'Overweight_Level_I', 6: 'Overweight_Level_II', 2: 'Obesity_Type_I', 0: 'Insufficient_Weight', 3: 'Obesity_Type_II', 4: 'Obesity_Type_III'}
Age : Numeric values
Height : Numeric values
Weight : Numeric values
MBI : Numeric values
Frequency_eat_vegetables : {1: 'Never', 2: 'Sometimes', 3: 'Always'}
Number_of_main_meals : {1: '1/day', 2: '2/day', 3: '3/day', 4: '3+/day'}
Frequency_water : {1: 'Less than 1L/day', 2: 'Between 1 and 2L/day', 3: 'More than 2L/day'}
Frequency_physical_activity : {0: 'No activity', 1: '1 or 2 days/week', 2: '2 or 4 days/week', 3: '4+ days/week'}
Time_using_technology_devices : {0: '0-2 hours/day', 1: '3-5 hours/day', 2: '5+ hours/day'}
Main_transport_Automobile : {0: 'no', 1: 'yes'}
Main_transport_Bike : {0: 'no', 1: 'yes'}
Main_transport_Motorbike : {0: 'no', 1: 'yes'}
Main_transport_Public_Transportation : {0: 'no', 1: 'yes'}
Main_transport_Walking : {0: 'no', 1: 'yes'}

For reference, the 'Obesity_level_category' value is based on the Mass Body Index formula.

MBI = weight / (height**2)

Here are the classifications :

  • Insufficient_Weight Less than 18.5
  • Normal_Weight 18.5 to 24.9
  • Overweight 25.0 to 29.9
  • Obesity_Type_I 30.0 to 34.9
  • Obesity_Type_II 35.0 to 39.9
  • Obesity_Type_III Higher than 40
In [15]:
df['MBI'] = df['Weight'] / df['Height']**2 #MBI for mass body index

for encoded_keys, category in decoded_dict['Obesity_level_category'].items():
  print(f"{category} min = {min(df[df['Obesity_level_category']==encoded_keys]['MBI'])}")
  print(f"{category} max = {max(df[df['Obesity_level_category']==encoded_keys]['MBI'])}")
Normal_Weight min = 18.49112426035503
Normal_Weight max = 24.913494809688583
Overweight_Level_I min = 22.851196670135273
Overweight_Level_I max = 28.70260272162769
Overweight_Level_II min = 25.695042931469242
Overweight_Level_II max = 30.38501560874089
Obesity_Type_I min = 29.832550467926424
Obesity_Type_I max = 35.265306122448976
Insufficient_Weight min = 12.989339783212396
Insufficient_Weight max = 19.09648547509741
Obesity_Type_II min = 33.90162649421909
Obesity_Type_II max = 39.81327880868793
Obesity_Type_III min = 36.62336752914814
Obesity_Type_III max = 50.82027465000501

Based on these numbers we can point out a few things :

  • First off, through all of the research that we have done, we have always found that there are no differenciations between Overweight_Level_I and Overweight_Level_II, so we can replace the both of them by simply Overweight
  • Then, we can see that some of the extremum values do not belong exactly to their respective categories, therefore we can correct their labels manually to fit the norms.
In [16]:
decoded_dict['Obesity_level_category']
Out[16]:
{1: 'Normal_Weight',
 5: 'Overweight_Level_I',
 6: 'Overweight_Level_II',
 2: 'Obesity_Type_I',
 0: 'Insufficient_Weight',
 3: 'Obesity_Type_II',
 4: 'Obesity_Type_III'}
In [17]:
def GetCategory(mbi):
  decoded_dict['Corrected_obesity_level_category'] = {0: 'Insufficient_Weight', 1: 'Normal_Weight', 2: 'Overweight', 3: 'Obesity_Type_I', 4: 'Obesity_Type_II', 5: 'Obesity_Type_III'}
  if mbi < 18.5: return 0
  elif mbi < 25: return 1
  elif mbi < 30: return 2
  elif mbi < 35: return 3
  elif mbi < 40: return 4
  else: return 5

df['Corrected_obesity_level_category'] = df['MBI'].apply(GetCategory).astype('uint8')
df.head()
Out[17]:
Gender Age Height Weight Family_history_with_overweight Frequency_eat_high_caloric_food Frequency_eat_vegetables Number_of_main_meals Frequency_eat_between_meals Smoke ... Time_using_technology_devices Frequency_alcohol Obesity_level_category Main_transport_Automobile Main_transport_Bike Main_transport_Motorbike Main_transport_Public_Transportation Main_transport_Walking MBI Corrected_obesity_level_category
0 0 21 1.62 64.0 1 0 2.0 3.0 2 0 ... 1.0 3 1 0 0 0 1 0 24.386526 1
1 0 21 1.52 56.0 1 0 3.0 3.0 2 1 ... 0.0 2 1 0 0 0 1 0 24.238227 1
2 1 23 1.80 77.0 1 0 2.0 3.0 2 0 ... 1.0 1 1 0 0 0 1 0 23.765432 1
3 1 27 1.80 87.0 0 0 3.0 3.0 2 0 ... 0.0 1 5 0 0 0 0 1 26.851852 2
4 1 22 1.78 89.8 0 0 2.0 1.0 2 0 ... 0.0 2 6 0 0 0 1 0 28.342381 2

5 rows × 23 columns

Let's check if everything is correct

In [18]:
for encoded_keys, category in decoded_dict['Corrected_obesity_level_category'].items():
  print(f"{category} min = {min(df[df['Corrected_obesity_level_category']==encoded_keys]['MBI'])}")
  print(f"{category} max = {max(df[df['Corrected_obesity_level_category']==encoded_keys]['MBI'])}")
Insufficient_Weight min = 12.989339783212396
Insufficient_Weight max = 18.49649887699828
Normal_Weight min = 18.51851851851852
Normal_Weight max = 24.990871840642622
Overweight min = 25.038566512400614
Overweight max = 29.975322978661634
Obesity_Type_I min = 30.10165737114829
Obesity_Type_I max = 34.97164461247637
Obesity_Type_II min = 35.00319921713275
Obesity_Type_II max = 39.99129046305705
Obesity_Type_III min = 40.009144947416544
Obesity_Type_III max = 50.82027465000501

Let's do a recap on our new dataset

In [19]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087 entries, 0 to 2086
Data columns (total 23 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Gender                                2087 non-null   uint8  
 1   Age                                   2087 non-null   uint8  
 2   Height                                2087 non-null   float64
 3   Weight                                2087 non-null   float64
 4   Family_history_with_overweight        2087 non-null   uint8  
 5   Frequency_eat_high_caloric_food       2087 non-null   uint8  
 6   Frequency_eat_vegetables              2087 non-null   float64
 7   Number_of_main_meals                  2087 non-null   float64
 8   Frequency_eat_between_meals           2087 non-null   uint8  
 9   Smoke                                 2087 non-null   uint8  
 10  Frequency_water                       2087 non-null   float64
 11  Monitoring_calories_consumption       2087 non-null   uint8  
 12  Frequency_physical_activity           2087 non-null   float64
 13  Time_using_technology_devices         2087 non-null   float64
 14  Frequency_alcohol                     2087 non-null   uint8  
 15  Obesity_level_category                2087 non-null   uint8  
 16  Main_transport_Automobile             2087 non-null   uint8  
 17  Main_transport_Bike                   2087 non-null   uint8  
 18  Main_transport_Motorbike              2087 non-null   uint8  
 19  Main_transport_Public_Transportation  2087 non-null   uint8  
 20  Main_transport_Walking                2087 non-null   uint8  
 21  MBI                                   2087 non-null   float64
 22  Corrected_obesity_level_category      2087 non-null   uint8  
dtypes: float64(8), uint8(15)
memory usage: 161.1 KB
In [20]:
df.isnull().sum()
Out[20]:
Gender                                  0
Age                                     0
Height                                  0
Weight                                  0
Family_history_with_overweight          0
Frequency_eat_high_caloric_food         0
Frequency_eat_vegetables                0
Number_of_main_meals                    0
Frequency_eat_between_meals             0
Smoke                                   0
Frequency_water                         0
Monitoring_calories_consumption         0
Frequency_physical_activity             0
Time_using_technology_devices           0
Frequency_alcohol                       0
Obesity_level_category                  0
Main_transport_Automobile               0
Main_transport_Bike                     0
Main_transport_Motorbike                0
Main_transport_Public_Transportation    0
Main_transport_Walking                  0
MBI                                     0
Corrected_obesity_level_category        0
dtype: int64
In [21]:
df.describe(include='all')
Out[21]:
Gender Age Height Weight Family_history_with_overweight Frequency_eat_high_caloric_food Frequency_eat_vegetables Number_of_main_meals Frequency_eat_between_meals Smoke ... Time_using_technology_devices Frequency_alcohol Obesity_level_category Main_transport_Automobile Main_transport_Bike Main_transport_Motorbike Main_transport_Public_Transportation Main_transport_Walking MBI Corrected_obesity_level_category
count 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 ... 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000
mean 0.504073 24.009104 1.702616 86.858649 0.825108 0.883565 2.421466 2.701179 1.853857 0.021083 ... 0.663035 2.270244 3.014375 0.218495 0.003354 0.005271 0.746526 0.026354 29.768852 2.484907
std 0.500103 6.333297 0.093251 26.190841 0.379966 0.320823 0.534737 0.764614 0.459494 0.143695 ... 0.608153 0.517008 1.948470 0.413324 0.057831 0.072426 0.435104 0.160223 8.029520 1.543128
min 0.000000 14.000000 1.450000 39.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 12.989340 0.000000
25% 0.000000 19.000000 1.630000 66.000000 1.000000 1.000000 2.000000 2.697467 2.000000 0.000000 ... 0.000000 2.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 24.368897 1.000000
50% 1.000000 22.000000 1.700000 83.100000 1.000000 1.000000 2.396265 3.000000 2.000000 0.000000 ... 0.630866 2.000000 3.000000 0.000000 0.000000 0.000000 1.000000 0.000000 28.864427 2.000000
75% 1.000000 26.000000 1.770000 108.000000 1.000000 1.000000 3.000000 3.000000 2.000000 0.000000 ... 1.000000 3.000000 5.000000 0.000000 0.000000 0.000000 1.000000 0.000000 36.068560 4.000000
max 1.000000 61.000000 1.980000 173.000000 1.000000 1.000000 3.000000 4.000000 3.000000 1.000000 ... 2.000000 3.000000 6.000000 1.000000 1.000000 1.000000 1.000000 1.000000 50.820275 5.000000

8 rows × 23 columns

# Ce texte est au format code

III. Data Visualization & Analysis

We're starting off by writing a method that we will use later :

autolabel is used for the barplots to display the value of each bar on top of it to facilitate the lecture.

In [22]:
def autolabel(bars, myax=None):
    if myax==None:
      for bar in bars:
          height = bar.get_height()
          ax.annotate('{}'.format(height),
                      xy=(bar.get_x() + bar.get_width() / 2, height),
                      xytext=(0, 3),
                      textcoords="offset points",
                      ha='center', va='bottom')
    else:
      for bar in bars:
          height = bar.get_height()
          myax.annotate('{}'.format(height),
                      xy=(bar.get_x() + bar.get_width() / 2, height),
                      xytext=(0, 3),
                      textcoords="offset points",
                      ha='center', va='bottom')

Now for the visualisation

In [23]:
obesity_types_df = df['Corrected_obesity_level_category'].value_counts()
lecture_order = list(range(len(decoded_dict['Corrected_obesity_level_category'])))
obesity_types_df = obesity_types_df.reindex(lecture_order, fill_value=0)

n_types = len(lecture_order)
label = [name for key,name in decoded_dict['Corrected_obesity_level_category'].items()]
cmap = plt.get_cmap('RdYlGn')
color = cmap(np.linspace(0, 1, len(label))[::-1])

fig, ax = plt.subplots(figsize=(8, 6))

ax.pie(obesity_types_df, labels=label, colors=color, autopct='%1.1f%%',explode=[0.005]*6)
plt.title('Obesity types representation in our dataset')
plt.tight_layout()
plt.show()

The majority of people are at least overweight. This is huge, given that it increases the risk of cancer, diabetes, heart disease and muscular and skeletal ageing.

This first graph shows how our study is important and proves that the problem is real.

In [24]:
obesity_types_df_men = df[df['Gender']==1]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_women = df[df['Gender']==0]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
lecture_order = list(range(len(decoded_dict['Corrected_obesity_level_category'])))

n_types = len(lecture_order)
index = np.arange(n_types)
bar_width = 0.35

fig, ax = plt.subplots(figsize=(9, 6))
bar_men = ax.bar(index, obesity_types_df_men, bar_width, color='dodgerblue', label='Men')
bar_women = ax.bar(index + bar_width, obesity_types_df_women, bar_width, color='violet', label='Women')
autolabel(bar_men)
autolabel(bar_women)

ax.set_title('Obesity types representation in our dataset by gender')
ax.set_xlabel('Obesity types')
ax.set_ylabel('Number of people from that category')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax.legend()
fig.tight_layout()
plt.show()

There seems to be a correlation between gender and obesity (shown later with the correlation matrix). We'll keep this feature.

We also see that there is only one male that belongs in the Obesity_Type_III level in this dataset, we'll see if this is explained by a huge correlation between the gender and obesity showing that women tend to have more critical obese conditions than men, or if it is simply a bias we'll have to consider that our dataset simply does not have many men that belong to that category.

In [25]:
obesity_types_df_young = df[df['Age']<20]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_adult = df[(df['Age']>=20) & (df['Age']<30)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_senior = df[(df['Age']>=30) & (df['Age']<40)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_old = df[df['Age']>40]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
lecture_order = list(range(len(decoded_dict['Corrected_obesity_level_category'])))

n_types = len(lecture_order)
index = np.arange(n_types)
bar_width = 0.2

fig, ax = plt.subplots(figsize=(9, 6))
bar_young = ax.bar(index, obesity_types_df_young, bar_width, color='DarkBlue', label='0-19')
bar_adult = ax.bar(index + bar_width, obesity_types_df_adult, bar_width, color='royalblue', label='20-29')
bar_senior = ax.bar(index + 2*bar_width, obesity_types_df_senior, bar_width, color='DodgerBlue', label='30-39')
bar_old = ax.bar(index + 3*bar_width, obesity_types_df_old, bar_width, color='LightSkyBlue', label='40+')
autolabel(bar_young)
autolabel(bar_adult)
autolabel(bar_senior)
autolabel(bar_old)

ax.set_title('Obesity types representation in our dataset by age')
ax.set_xlabel('Obesity types')
ax.set_ylabel('Number of people from that category')
ax.set_xticks(index + bar_width * 1.5)
ax.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax.legend()
fig.tight_layout()
plt.show()

It's very difficult to be obese when you're young, due to morphological phenomena. However, there are still young people who are obese. This is confirmed by the fact that the number of obese people is increasing drastically as the population ages.

In [26]:
obesity_types_df_young_men = df[(df['Age']<20) & (df['Gender']==1)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_adult_men = df[(df['Age']>=20) & (df['Age']<30) & (df['Gender']==1)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_senior_men = df[(df['Age']>=30) & (df['Age']<40) & (df['Gender']==1)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_old_men = df[(df['Age']>40) & (df['Gender']==1)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_young_women = df[(df['Age']<20) & (df['Gender']==0)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_adult_women = df[(df['Age']>=20) & (df['Age']<30) & (df['Gender']==0)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_senior_women = df[(df['Age']>=30) & (df['Age']<40) & (df['Gender']==0)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_old_women = df[(df['Age']>40) & (df['Gender']==0)]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
lecture_order = list(range(len(decoded_dict['Corrected_obesity_level_category'])))

n_types = len(lecture_order)
index = np.arange(n_types)
bar_width = 0.2

fig = plt.figure(figsize=(16, 8))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

bar_young_men = ax1.bar(index, obesity_types_df_young_men, bar_width, color='aquamarine', label='0-19')
bar_adult_men = ax1.bar(index + bar_width, obesity_types_df_adult_men, bar_width, color='deepskyblue', label='20-29')
bar_senior_men = ax1.bar(index + 2*bar_width, obesity_types_df_senior_men, bar_width, color='royalblue', label='30-39')
bar_old_men = ax1.bar(index + 3*bar_width, obesity_types_df_old_men, bar_width, color='navy', label='40+')
autolabel(bar_young_men, ax1)
autolabel(bar_adult_men, ax1)
autolabel(bar_senior_men, ax1)
autolabel(bar_old_men, ax1)

ax1.set_title('Obesity types representation in our dataset by age for men')
ax1.set_xlabel('Obesity types')
ax1.set_ylabel('Number of people from that category')
ax1.set_xticks(index + bar_width * 1.5)
ax1.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax1.legend()

bar_young_women = ax2.bar(index, obesity_types_df_young_women, bar_width, color='lightpink', label='0-19')
bar_adult_women = ax2.bar(index + bar_width, obesity_types_df_adult_women, bar_width, color='magenta', label='20-29')
bar_senior_women = ax2.bar(index + 2*bar_width, obesity_types_df_senior_women, bar_width, color='crimson', label='30-39')
bar_old_women = ax2.bar(index + 3*bar_width, obesity_types_df_old_women, bar_width, color='maroon', label='40+')
autolabel(bar_young_women, ax2)
autolabel(bar_adult_women, ax2)
autolabel(bar_senior_women, ax2)
autolabel(bar_old_women, ax2)

ax2.set_title('Obesity types representation in our dataset by age for women')
ax2.set_xlabel('Obesity types')
ax2.set_ylabel('Number of people from that category')
ax2.set_xticks(index + bar_width * 1.5)
ax2.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax2.legend()

fig.tight_layout()
plt.show()

These graphs confirm what we have seen and concluded earlier.

Indeed, the majority of the people contained in this dataset being in their 20s, and the majority of them being at least Overweight for men and in Obesity_Type_III for women.

This confirms the importance of our subject for the upcoming generations.

In [27]:
obesity_types_df_family = df[df['Family_history_with_overweight']==1]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_no_family = df[df['Family_history_with_overweight']==0]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_eat_hcf = df[df['Frequency_eat_high_caloric_food']==1]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_no_eat_hcf = df[df['Frequency_eat_high_caloric_food']==0]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_eat_once_between_meals = df[df['Frequency_eat_between_meals']==0]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_eat_twice_between_meals = df[df['Frequency_eat_between_meals']==1]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_eat_thrice_between_meals = df[df['Frequency_eat_between_meals']==2]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_eat_more_between_meals = df[df['Frequency_eat_between_meals']==3]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_never_eat_vegetables = df[df['Frequency_eat_vegetables']<2]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_often_eat_vegetables = df[df['Frequency_eat_vegetables']<3]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
obesity_types_df_always_eat_vegetables = df[df['Frequency_eat_vegetables']==3]['Corrected_obesity_level_category'].value_counts().reindex(lecture_order, fill_value=0)
lecture_order = list(range(len(decoded_dict['Corrected_obesity_level_category'])))

n_types = len(lecture_order)
index = np.arange(n_types)
bar_width = 0.2

fig = plt.figure(figsize=(25, 22))
ax1 = fig.add_subplot(4,1,1)
ax2 = fig.add_subplot(4,1,2)
ax3 = fig.add_subplot(4,1,3)
ax4 = fig.add_subplot(4,1,4)

bar_family = ax1.bar(index, obesity_types_df_family, bar_width, color='darkorange', label='Family with obesity')
bar_no_family = ax1.bar(index + bar_width, obesity_types_df_no_family, bar_width, color='saddlebrown', label='Family without obesity')
autolabel(bar_family, ax1)
autolabel(bar_no_family, ax1)

ax1.set_title('Obesity types representation in our dataset depending on the history of obesity in the family')
ax1.set_xlabel('Obesity types')
ax1.set_ylabel('Number of people from that category')
ax1.set_xticks(index + bar_width / 2)
ax1.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax1.legend()

bar_hcf = ax2.bar(index, obesity_types_df_eat_hcf, bar_width, color='mediumspringgreen', label='High caloric food')
bar_no_hcf = ax2.bar(index + bar_width, obesity_types_df_no_eat_hcf, bar_width, color='lightseagreen', label='No high caloric food')
autolabel(bar_hcf, ax2)
autolabel(bar_no_hcf, ax2)

ax2.set_title('Obesity types representation in our dataset depending on if they eat high caloric food')
ax2.set_xlabel('Obesity types')
ax2.set_ylabel('Number of people from that category')
ax2.set_xticks(index + bar_width / 2)
ax2.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax2.legend()

bar_once = ax3.bar(index, obesity_types_df_eat_once_between_meals, bar_width, color='greenyellow', label='Once')
bar_twice = ax3.bar(index + bar_width, obesity_types_df_eat_twice_between_meals, bar_width, color='lime', label='Twice')
bar_thrice = ax3.bar(index + bar_width * 2, obesity_types_df_eat_thrice_between_meals, bar_width, color='forestgreen', label='Thrice')
bar_more = ax3.bar(index + bar_width * 3, obesity_types_df_eat_more_between_meals, bar_width, color='darkgreen', label='More')
autolabel(bar_once, ax3)
autolabel(bar_twice, ax3)
autolabel(bar_thrice, ax3)
autolabel(bar_more, ax3)

ax3.set_title('Obesity types representation in our dataset depending on the frequency of eating between meals')
ax3.set_xlabel('Obesity types')
ax3.set_ylabel('Number of people from that category')
ax3.set_xticks(index + bar_width / 4)
ax3.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax3.legend()

bar_never = ax4.bar(index, obesity_types_df_never_eat_vegetables, bar_width, color='royalblue', label='Always')
bar_often = ax4.bar(index + bar_width, obesity_types_df_often_eat_vegetables, bar_width, color='blueviolet', label='Often')
bar_always = ax4.bar(index + bar_width * 2, obesity_types_df_always_eat_vegetables, bar_width, color='darkmagenta', label='Never')
autolabel(bar_never, ax4)
autolabel(bar_often, ax4)
autolabel(bar_always, ax4)

ax4.set_title('Obesity types representation in our dataset depending on the frequency of eating vegetables')
ax4.set_xlabel('Obesity types')
ax4.set_ylabel('Number of people from that category')
ax4.set_xticks(index + bar_width / 3)
ax4.set_xticklabels(decoded_dict['Corrected_obesity_level_category'].values(), rotation=45)
ax4.legend()

fig.tight_layout()
plt.show()

There is an unsettling fact here visible in these graphs.

We can see that ALL of the 264 data categorized as Obesity_Type_III seem to always share the SAME value for multiple rows, the one exception probably being the single male belonging to that category, that is most likely a real human data.

This is troubling and concerning, because it means that there is absolutely no diversity in the latest obesity_types. Where this could be interpreted by the fact that maybe many values are similar because they, scientifically, hugely impact the obesity of an individual, this still seems unlikely to see this event happen in such a massive scale.

This is further proof that working with artificially created data by oversampling complicates our work here, as there are many biases that will affect our prediction model, and we can't really help but to acknowledge and allow them to exist.

In [28]:
plot = sns.pairplot(df,
             hue ='Corrected_obesity_level_category',
             vars = ['Gender','Age',
                     'Height', 'Weight'])

category_labels = {
    '0': 'Insufficient_Weight',
    '1': 'Normal_Weight',
    '2': 'Overweight',
    '3': 'Obesity_Type_I',
    '4': 'Obesity_Type_II',
    '5': 'Obesity_Type_III',
}

handles, legend_labels = plot._legend_data.values(), plot._legend_data.keys()
for category, label in zip(legend_labels, plot._legend.texts):
    if category in category_labels:
        label.set_text(category_labels[category])
plt.show()
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):

This is a representation of the repartition of our dataset regarding the four most basic features used to describe someone to get a global view of the population represented in our work here.

IV. Data distribution

Here, we'll see if some features aren't useable in terms of data distribution.

  • if the variability is too small
  • if scope is too small
  • the outliers
  • if the distribution is useable (globally)
For the global distribution
In [29]:
fig, ax = plt.subplots(1, figsize=(17, 20))
df.hist(ax=ax)
C:\Users\Louis\AppData\Local\Temp\ipykernel_19592\2122103899.py:2: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared.
  df.hist(ax=ax)
Out[29]:
array([[<Axes: title={'center': 'Gender'}>,
        <Axes: title={'center': 'Age'}>,
        <Axes: title={'center': 'Height'}>,
        <Axes: title={'center': 'Weight'}>,
        <Axes: title={'center': 'Family_history_with_overweight'}>],
       [<Axes: title={'center': 'Frequency_eat_high_caloric_food'}>,
        <Axes: title={'center': 'Frequency_eat_vegetables'}>,
        <Axes: title={'center': 'Number_of_main_meals'}>,
        <Axes: title={'center': 'Frequency_eat_between_meals'}>,
        <Axes: title={'center': 'Smoke'}>],
       [<Axes: title={'center': 'Frequency_water'}>,
        <Axes: title={'center': 'Monitoring_calories_consumption'}>,
        <Axes: title={'center': 'Frequency_physical_activity'}>,
        <Axes: title={'center': 'Time_using_technology_devices'}>,
        <Axes: title={'center': 'Frequency_alcohol'}>],
       [<Axes: title={'center': 'Obesity_level_category'}>,
        <Axes: title={'center': 'Main_transport_Automobile'}>,
        <Axes: title={'center': 'Main_transport_Bike'}>,
        <Axes: title={'center': 'Main_transport_Motorbike'}>,
        <Axes: title={'center': 'Main_transport_Public_Transportation'}>],
       [<Axes: title={'center': 'Main_transport_Walking'}>,
        <Axes: title={'center': 'MBI'}>,
        <Axes: title={'center': 'Corrected_obesity_level_category'}>,
        <Axes: >, <Axes: >]], dtype=object)

"Family history with overweight", "Frequency eat high caloric food", "number of main meals", "Smoke" and "monitoring calory consumption" have very concentrated values. So we will check if these values have a real impact on the target (obesity type). If not, these features won't be necessary (rare and less important events, these caracteritics don't make ML algorithms work efficiently).

In [30]:
fig, ax = plt.subplots(5,1, figsize=(10,17))
ax[0].scatter(df["Number_of_main_meals"], df["MBI"])
ax[0].set_title("MBI = f(Number_of_main_meals)")
ax[1].scatter(df["Monitoring_calories_consumption"], df["MBI"])
ax[1].set_title("MBI = f(Monitoring_calories_consumption)")
ax[2].scatter(df["Family_history_with_overweight"], df["MBI"])
ax[2].set_title("MBI = f(Family_history_with_overweight)")
ax[3].scatter(df["Frequency_eat_high_caloric_food"], df["MBI"])
ax[3].set_title("MBI = f(Frequency_eat_high_caloric_food)")
ax[4].scatter(df["Smoke"], df["MBI"])
ax[4].set_title("MBI = f(Smoke)")
Out[30]:
Text(0.5, 1.0, 'MBI = f(Smoke)')

In every graph (except monitoring calory consumption and family history with overweight) we don't see a huge impact of the feature on the result. Given that there are rare event, we won't keep this features.

Also on the histogram of distribution of frequency water (above) is continuous but there where only 4 different answers possible during the survey. We explain it with the oversampling action that they did. Below you'll see the proportion of normal values (discrete ones) and none normal.

However, these graphs do not represent the BEST way to show the correlation between the observed features and our target. We will try to find another way to show this later.

In [31]:
# Function to count decimal values in a column
def is_decimal(value):
    if value%1 != 0:
        return True
    else:
        return False

# Apply the function to the specified column
decimal_counts = df['Frequency_water'].apply(lambda x: is_decimal(x))
decimal_counts += df['Number_of_main_meals'].apply(lambda x: is_decimal(x))
decimal_counts += df['Frequency_eat_vegetables'].apply(lambda x: is_decimal(x))
decimal_counts += df['Frequency_physical_activity'].apply(lambda x: is_decimal(x))
decimal_counts += df['Time_using_technology_devices'].apply(lambda x: is_decimal(x))
decimal_counts = decimal_counts.value_counts()
percentage_decimal_water = (100*decimal_counts[0])/df.shape[0]

print(f"there is {decimal_counts[0]} decimal values so it represents {percentage_decimal_water:.2f}%")
print(decimal_counts)
there is 1580 decimal values so it represents 75.71%
Frequency_water
True     1580
False     507
Name: count, dtype: int64
C:\Users\Louis\AppData\Local\Temp\ipykernel_19592\394126203.py:15: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  percentage_decimal_water = (100*decimal_counts[0])/df.shape[0]
C:\Users\Louis\AppData\Local\Temp\ipykernel_19592\394126203.py:17: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  print(f"there is {decimal_counts[0]} decimal values so it represents {percentage_decimal_water:.2f}%")

So this is confirms the fact that 77% are artificial values (because 25% of values are discrete and 23% < 25%). In other words it is possible that some of the artificially generated rows got integer values in all of these observed columns.

Fortunately this shouldn't affect too badly our clustering results (shoud be the contrary in fact). We just have to keep in mind that our results will be kind of overfitting (because of the oversampling we overfit the sample of the study (which is small: 485)).

The outliers and the scope
In [32]:
df_scope_1 = df[['Monitoring_calories_consumption', 'Frequency_eat_vegetables',
              'Frequency_eat_between_meals', 'Frequency_water']]
df_scope_2 = df[['Frequency_physical_activity', 'Time_using_technology_devices',
              'Frequency_alcohol', 'Obesity_level_category']]
plt.figure(figsize=(15, 3))
df_scope_2.boxplot(column=['Frequency_physical_activity', 'Time_using_technology_devices',
              'Frequency_alcohol', 'Obesity_level_category'])
plt.show()
plt.figure(figsize=(15, 3))
df_scope_1.boxplot(column=['Monitoring_calories_consumption', 'Frequency_eat_vegetables',
              'Frequency_eat_between_meals', 'Frequency_water'])
plt.show()

fig = plt.figure(figsize=(15,3))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax1.boxplot(df["Age"])
ax2.boxplot(df["Height"])
ax1.grid()
ax2.grid()
ax1.set_xlabel('Age')
ax2.set_xlabel('Height')
plt.show()

There is not relevent outliers to drop (for example for the age it's important to represent the population, the only problem that is shown is that the sample isn't representative for the population)

The variability
In [33]:
df[["Age", "Height", "Frequency_physical_activity", "Time_using_technology_devices"]].describe()
Out[33]:
Age Height Frequency_physical_activity Time_using_technology_devices
count 2087.000000 2087.000000 2087.000000 2087.000000
mean 24.009104 1.702616 1.012812 0.663035
std 6.333297 0.093251 0.853475 0.608153
min 14.000000 1.450000 0.000000 0.000000
25% 19.000000 1.630000 0.124505 0.000000
50% 22.000000 1.700000 1.000000 0.630866
75% 26.000000 1.770000 1.678102 1.000000
max 61.000000 1.980000 3.000000 2.000000

We notice a small Standard Deviation for Height, it's because the data has not been normalized yet, but in the boxplots the distribution isn't that problematic. Globally, we don't have STD problems.

Data Distribution Conclusion

To sum up, we'll keep all of the features we have for the moment.

Before dropping anything, we will check the correlation for each of them to make sure we're making the right choice.

V. Correlation handling

In [34]:
corr = df.corr()
f, ax = plt.subplots(figsize=(16, 10))
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.color_palette("YlOrBr", as_cmap=True)
sns.heatmap(round(corr,2), mask=mask, cmap=cmap, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5},
            annot = True, annot_kws={"size": 8})
C:\Users\Louis\anaconda3\lib\site-packages\seaborn\matrix.py:260: FutureWarning: Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior
  annotation = ("{:" + self.fmt + "}").format(val)
Out[34]:
<Axes: >
In [35]:
highly_correlated_features = []
threshold = 0.7
for i in range(len(corr.columns)):
    for j in range(i):
        if abs(corr.iloc[i, j]) > threshold:
            colname = corr.columns[i]
            rowname = corr.index[j]
            highly_correlated_features.append((colname, rowname))
highly_correlated_features
Out[35]:
[('Main_transport_Public_Transportation', 'Main_transport_Automobile'),
 ('MBI', 'Weight'),
 ('Corrected_obesity_level_category', 'Weight'),
 ('Corrected_obesity_level_category', 'MBI')]

Because we're aiming to predict obesity level based on today's life criteria and without weight (and MBI), we can get rid of them for the prediction.
Also 'Main_transport_Public_Transportation' and 'Main_transport_Automobile' are highly correlated, let's take a deeper view into it.

In [36]:
transport_corr = df[["Main_transport_Public_Transportation", "Main_transport_Automobile"]]
transport_corr_repartition = transport_corr.groupby(["Main_transport_Public_Transportation", "Main_transport_Automobile"]).value_counts().reset_index()
def transport_car_or_bus(row):
    if row["Main_transport_Public_Transportation"] and row["Main_transport_Automobile"]:
        return "Automobile & Public transport"
    elif row["Main_transport_Public_Transportation"]:
        return "Public transport"
    elif row["Main_transport_Automobile"]:
        return "Automobile"
    return "Other"
transport_corr_repartition["transport"] = transport_corr_repartition.apply(transport_car_or_bus, axis=1)
transport_corr_repartition.rename(columns={0:"count"}, inplace=True)
colors = ['gold','cornsilk','goldenrod']
plt.pie(transport_corr_repartition["count"], labels=transport_corr_repartition["transport"], colors = colors)
plt.title('Pie chart of Automobile and Public transport repartition in Central America')
plt.tight_layout()
plt.show()

It seems that public transport and automobile are highly correlated (negatively). It is because they share a majority of exactly different values, since the "other" category represents a minority.

In [37]:
contingency_table = pd.crosstab(transport_corr_repartition['Main_transport_Automobile'], transport_corr_repartition['Main_transport_Public_Transportation'], values=transport_corr_repartition['count'], aggfunc='sum', margins=True, margins_name='Total')

plt.figure(figsize=(8, 6))
sns.heatmap(contingency_table, annot=True, cmap="Blues", fmt='g', cbar=True)
plt.title('Double Entry Table - Automobiles vs Public transport')
plt.show()

As we see there aren't a lot of 0-0 couples (False False square). And as we said above there are a lot of (0-1) and (1-0) because not many people answered with something else than Public transport or automobile.

In [38]:
plt.pie([49,28,23], labels =["Public transport", "Automobile", "Bike or walking"],
        colors = ['thistle','plum','mediumpurple'])
plt.title("Mexico City's urban areas transportation in 2014")
plt.tight_layout()
plt.show()

On the pie plot above, it also appears that a vast majority of central americans are moving by public transportation only.

This is suprising given that the asked question was : "Which transportation do you usually use". Here is a review of transportation in Mexico city (where public transportation is expected to be higher than in central america (our dataset)) (see)

We could argue that this is because in the survey they hade to choose one mean of transport, however the difference is too big to be explained by only this reason.

However we'll continue with this feature, knowing that our dataset is globally messed up (as we feared when describing the dataset).

Correlation Handling Conclusion

To sum up here are the features that we'll remove:

  • Weight
  • MBI
  • Obesity_type_level
  • Correct_obesity_type_level

(for obvious reasons)

  • Smoke (the data is not well distributed and not very correlated to our target)
  • Number_of_main_meals (same thing)
  • All of the Main_transport (difficult to work with because the question is too vague (can only answer one mean of transport))

VI. Prediction model

In [39]:
df.describe()
Out[39]:
Gender Age Height Weight Family_history_with_overweight Frequency_eat_high_caloric_food Frequency_eat_vegetables Number_of_main_meals Frequency_eat_between_meals Smoke ... Time_using_technology_devices Frequency_alcohol Obesity_level_category Main_transport_Automobile Main_transport_Bike Main_transport_Motorbike Main_transport_Public_Transportation Main_transport_Walking MBI Corrected_obesity_level_category
count 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 ... 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000 2087.000000
mean 0.504073 24.009104 1.702616 86.858649 0.825108 0.883565 2.421466 2.701179 1.853857 0.021083 ... 0.663035 2.270244 3.014375 0.218495 0.003354 0.005271 0.746526 0.026354 29.768852 2.484907
std 0.500103 6.333297 0.093251 26.190841 0.379966 0.320823 0.534737 0.764614 0.459494 0.143695 ... 0.608153 0.517008 1.948470 0.413324 0.057831 0.072426 0.435104 0.160223 8.029520 1.543128
min 0.000000 14.000000 1.450000 39.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 12.989340 0.000000
25% 0.000000 19.000000 1.630000 66.000000 1.000000 1.000000 2.000000 2.697467 2.000000 0.000000 ... 0.000000 2.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 24.368897 1.000000
50% 1.000000 22.000000 1.700000 83.100000 1.000000 1.000000 2.396265 3.000000 2.000000 0.000000 ... 0.630866 2.000000 3.000000 0.000000 0.000000 0.000000 1.000000 0.000000 28.864427 2.000000
75% 1.000000 26.000000 1.770000 108.000000 1.000000 1.000000 3.000000 3.000000 2.000000 0.000000 ... 1.000000 3.000000 5.000000 0.000000 0.000000 0.000000 1.000000 0.000000 36.068560 4.000000
max 1.000000 61.000000 1.980000 173.000000 1.000000 1.000000 3.000000 4.000000 3.000000 1.000000 ... 2.000000 3.000000 6.000000 1.000000 1.000000 1.000000 1.000000 1.000000 50.820275 5.000000

8 rows × 23 columns

In [40]:
sorted_corr = corr['Corrected_obesity_level_category'].abs().sort_values()
print(sorted_corr)
Smoke                                   0.005760
Main_transport_Bike                     0.028977
Number_of_main_meals                    0.029756
Main_transport_Motorbike                0.035747
Main_transport_Automobile               0.040675
Gender                                  0.052256
Main_transport_Public_Transportation    0.095328
Height                                  0.121205
Time_using_technology_devices           0.123959
Main_transport_Walking                  0.127328
Frequency_water                         0.142130
Frequency_alcohol                       0.180556
Monitoring_calories_consumption         0.183180
Frequency_physical_activity             0.188024
Age                                     0.237302
Frequency_eat_vegetables                0.249608
Frequency_eat_high_caloric_food         0.257411
Frequency_eat_between_meals             0.333918
Obesity_level_category                  0.428480
Family_history_with_overweight          0.482375
Weight                                  0.916672
MBI                                     0.982481
Corrected_obesity_level_category        1.000000
Name: Corrected_obesity_level_category, dtype: float64

Based on the correlation heatmap we've done before, we can directly see which variable have a bigger correlation with our "corrected_obesity_level_category".

However we will follow what we've seen in our previous analysis and drop some columns based on the work done in part III.

Since we want to predict the obesity type and we have access to this through the column "corrected_obesity_level_category", we're gonna do some supervised learning.

Let's make a quick pipeline to quickly see different algorithms first.

In [41]:
Y = df['Corrected_obesity_level_category']
X = df.drop(columns = ['Corrected_obesity_level_category', 'MBI', 'Obesity_level_category', 'Weight','Number_of_main_meals', 'Main_transport_Bike', 'Main_transport_Automobile', 'Main_transport_Motorbike','Main_transport_Public_Transportation','Main_transport_Walking', 'Smoke'])
In [42]:
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2,random_state=1, stratify = Y)

First we split the dataset into training and testing, and X (our variables) and Y (what we want to predict)

In [43]:
ss = StandardScaler()
x_train_scaled = ss.fit_transform(x_train)
x_test_scaled = ss.transform(x_test)

mm = MinMaxScaler()
x_train_mm_scaled = ss.fit_transform(x_train)
x_test_mm_scaled = ss.transform(x_test)

We're gonna test the algorithms for two different scalers.

In [44]:
pipeline = [
    ('Random Forest', RandomForestClassifier()),
    ('Decision Tree', DecisionTreeClassifier()),
    ('KNN', KNeighborsClassifier()),
    ('SVM', SVC())
]

We're testing for 4 different classification algorithms.

In [45]:
def model_predict(X_train: pd.DataFrame, y_train: pd.DataFrame,X_test: pd.DataFrame, y_test: pd.DataFrame):
  for name, model in pipeline:

          clf = model.fit(x_train, y_train)
          y_pred = clf.predict(x_test)

          clf_scaled = model.fit(x_train_scaled, y_train)
          y_pred_ss_scaled = clf_scaled.predict(x_test_scaled)

          clf_mm_scaled = model.fit(x_train_scaled, y_train)
          y_pred_mm_scaled = clf_scaled.predict(x_test_scaled)

          accuracy = round(accuracy_score(y_test, y_pred),5)
          scaled_ss_accuracy = round(accuracy_score(y_test, y_pred_ss_scaled),5)
          scaled_mm_accuracy = round(accuracy_score(y_test, y_pred_mm_scaled),5)


          print(name + ':')
          print("---------------------------------------------------------------")
          print("Accuracy:", accuracy)
          print("Accuracy w/Scaled_ss_accuracy Data (ss):", scaled_ss_accuracy)
          print("Accuracy w/scaled_mm_accuracy Data (mm):", scaled_mm_accuracy)
          if (accuracy > scaled_ss_accuracy) and (accuracy > scaled_mm_accuracy):
              print("\nClassification Report:\n", classification_report(y_test, y_pred))
              print("                            -----------------------------------               \n")
          elif (scaled_ss_accuracy > scaled_mm_accuracy):
              print("\nClassification Report (ss):\n", classification_report(y_test, y_pred_ss_scaled))
              print("                            -----------------------------------               \n")
          else:
              print("\nClassification Report (mm):\n", classification_report(y_test, y_pred_mm_scaled))
              print("                            -----------------------------------               \n")
In [46]:
model_predict(x_train,y_train,x_test,y_test)
Random Forest:
---------------------------------------------------------------
Accuracy: 0.84211
Accuracy w/Scaled_ss_accuracy Data (ss): 0.83254
Accuracy w/scaled_mm_accuracy Data (mm): 0.85407

Classification Report (mm):
               precision    recall  f1-score   support

           0       0.96      0.83      0.89        53
           1       0.67      0.69      0.68        59
           2       0.81      0.86      0.84       111
           3       0.93      0.86      0.90        74
           4       0.87      0.90      0.88        68
           5       0.94      0.96      0.95        53

    accuracy                           0.85       418
   macro avg       0.86      0.85      0.86       418
weighted avg       0.86      0.85      0.85       418

                            -----------------------------------               

Decision Tree:
---------------------------------------------------------------
Accuracy: 0.70813
Accuracy w/Scaled_ss_accuracy Data (ss): 0.71053
Accuracy w/scaled_mm_accuracy Data (mm): 0.71292

Classification Report (mm):
               precision    recall  f1-score   support

           0       0.67      0.77      0.72        53
           1       0.45      0.36      0.40        59
           2       0.71      0.65      0.68       111
           3       0.72      0.81      0.76        74
           4       0.75      0.78      0.76        68
           5       0.93      0.96      0.94        53

    accuracy                           0.71       418
   macro avg       0.70      0.72      0.71       418
weighted avg       0.70      0.71      0.71       418

                            -----------------------------------               

KNN:
---------------------------------------------------------------
Accuracy: 0.71292
Accuracy w/Scaled_ss_accuracy Data (ss): 0.74163
Accuracy w/scaled_mm_accuracy Data (mm): 0.74163

Classification Report (mm):
               precision    recall  f1-score   support

           0       0.66      0.85      0.74        53
           1       0.59      0.34      0.43        59
           2       0.72      0.73      0.73       111
           3       0.77      0.77      0.77        74
           4       0.79      0.87      0.83        68
           5       0.87      0.91      0.89        53

    accuracy                           0.74       418
   macro avg       0.73      0.74      0.73       418
weighted avg       0.73      0.74      0.73       418

                            -----------------------------------               

SVM:
---------------------------------------------------------------
Accuracy: 0.36603
Accuracy w/Scaled_ss_accuracy Data (ss): 0.7201
Accuracy w/scaled_mm_accuracy Data (mm): 0.7201

Classification Report (mm):
               precision    recall  f1-score   support

           0       0.75      0.68      0.71        53
           1       0.60      0.53      0.56        59
           2       0.72      0.69      0.71       111
           3       0.71      0.61      0.66        74
           4       0.71      0.91      0.80        68
           5       0.82      0.94      0.88        53

    accuracy                           0.72       418
   macro avg       0.72      0.73      0.72       418
weighted avg       0.72      0.72      0.71       418

                            -----------------------------------               

So as we can see, the random tree classifier seems to be the more accurate. Let's give it a closer look.

We can use a random forest classifier. It's basically creating lots of decision trees, and then use them all to decide. It's really effective for classification problems such as this one. Each decision trees will take into account a fixed amount of variable among every variables, to create more variation among the trees.

In [47]:
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2,random_state=2, stratify = Y)
scaler = MinMaxScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

Now we would like to know what parameters are optimal for our random forest classifier.

In [48]:
param_grid = {
    'n_estimators': [5, 10, 50, 100],
    'max_depth': [3, 10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
rf_classifier = RandomForestClassifier()
grid_search = GridSearchCV(rf_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_train_scaled, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

y_pred_best = best_model.predict(x_test_scaled)
accuracy_best = accuracy_score(y_test, y_pred_best)

print("Best Parameters:", best_params)
print("Best Accuracy on Test Set:", accuracy_best)
Best Parameters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Best Accuracy on Test Set: 0.8253588516746412

GridSearchCV allows to try different combinations of parameters, and returns the most effective one. Here we can see that we obtain the best accuracy with every parameters default value as we can see here. image.png

Let's apply our algorithm then:

In [49]:
scores = []
for _ in range(25):
  rfc = RandomForestClassifier()
  rfc.fit(x_train_scaled, y_train)
  y_pred = rfc.predict(x_test_scaled)
  scores.append(rfc.score(x_test_scaled, y_test))
print(max(scores))
print(np.mean(scores))
print(min(scores))
0.84688995215311
0.8285167464114832
0.8133971291866029

We can make a cross validation : it is a way to train several times the algorithm on different parts of the dataset in order to reduce the risk of overfitting.

In [50]:
rfc = RandomForestClassifier()
scores = cross_val_score(rfc, x_train_scaled, y_train, cv=10)
rfc.fit(x_train_scaled, y_train)

print("Cross-validation scores:", scores)
print("Mean cross-validation score:", scores.mean())
Cross-validation scores: [0.83233533 0.80838323 0.83233533 0.84431138 0.79640719 0.81437126
 0.80239521 0.86826347 0.86227545 0.84337349]
Mean cross-validation score: 0.8304451338287281

We can see our random tree classifier seems to do well. We will try to use it later on on ourselves to see if the algorithm is working with some new data !

In [51]:
features = np.concatenate((x_train_scaled, x_test_scaled), axis=0)
importances = rfc.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfc.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

cmap = plt.get_cmap('cool')
colors = cmap(np.linspace(0, 1, len(indices)))

plt.figure()
plt.title("Feature Importances")
plt.bar(range(features.shape[1]), importances[indices],
       color=colors, yerr=std[indices], align="center")
plt.xticks(range(features.shape[1]), pd.DataFrame(features).columns[indices], rotation='vertical')
plt.xlim([-1, features.shape[1]])
plt.show()
In [52]:
x_train.columns
Out[52]:
Index(['Gender', 'Age', 'Height', 'Family_history_with_overweight',
       'Frequency_eat_high_caloric_food', 'Frequency_eat_vegetables',
       'Frequency_eat_between_meals', 'Frequency_water',
       'Monitoring_calories_consumption', 'Frequency_physical_activity',
       'Time_using_technology_devices', 'Frequency_alcohol'],
      dtype='object')

This graph allows us to see which variable is the most impactful to predict the obesity type. 6 columns seem quite effective:

  1. Age
  2. Frequency_eat_vegetables
  3. Height
  4. Frequency_physical_activity
  5. Time_using_technology_devices
  6. Frequency_water
In [53]:
importances_sorted = np.sort(importances)[::-1]
data_importances = {
    'Feature': [df.columns[i] for i in indices],
    'Importance': [round(importances_sorted[i]*100, 2) for i in range(len(indices))]
}
importance_df = pd.DataFrame(data_importances['Importance'], index=data_importances['Feature'])
importance_df.columns = ['% of importance']
importance_df
Out[53]:
% of importance
Age 15.60
Frequency_eat_high_caloric_food 14.25
Height 14.08
Smoke 11.43
Frequency_water 10.95
Number_of_main_meals 10.57
Frequency_eat_vegetables 5.30
Gender 4.98
Weight 4.86
Monitoring_calories_consumption 4.79
Family_history_with_overweight 2.33
Frequency_eat_between_meals 0.86
In [70]:
rfc = RandomForestClassifier()
rfc.fit(x_train_scaled, y_train)
y_pred = rfc.predict(x_test_scaled)
print(rfc.score(x_test_scaled, y_test))

y_test_np = np.array(y_test)

# Print the results
print("Predicted Labels:")
print(y_pred)

print("\nTrue Labels:")
print(y_test_np)
0.8277511961722488
Predicted Labels:
[0 3 5 2 2 2 5 3 3 2 2 4 4 1 2 1 1 1 0 2 2 2 2 2 2 1 4 3 3 4 1 5 3 5 0 3 3
 3 0 1 0 2 0 3 2 5 4 2 2 0 0 3 2 2 3 2 0 5 1 3 1 0 0 3 0 5 2 3 2 2 4 2 4 3
 5 1 5 5 1 4 0 4 2 2 0 5 3 3 5 2 4 4 2 3 2 0 4 1 4 2 2 3 1 4 2 4 5 4 1 2 1
 3 4 3 2 0 1 4 5 5 4 3 4 3 5 3 0 1 2 3 4 0 0 4 5 4 4 0 5 0 0 3 1 2 2 2 0 2
 2 4 4 4 2 3 4 0 1 1 0 3 4 5 5 5 5 2 4 2 3 5 3 2 0 2 5 2 1 1 1 1 5 2 1 2 4
 2 1 2 2 3 4 0 4 5 0 1 3 4 3 4 4 5 1 1 5 0 2 5 3 4 2 2 4 3 0 2 4 0 2 4 3 2
 2 4 3 1 3 3 1 2 4 4 0 0 1 4 4 2 5 2 4 0 4 1 5 0 1 1 5 0 2 5 3 3 2 2 1 2 4
 2 0 0 2 2 5 0 2 0 1 1 2 1 4 1 2 4 1 2 4 3 1 2 0 1 2 2 2 2 4 3 4 0 2 2 1 2
 1 2 2 5 4 1 3 5 5 2 3 5 4 2 3 1 3 4 1 3 4 4 3 3 0 3 2 2 5 2 5 3 1 0 4 2 2
 1 2 1 3 2 3 5 5 1 3 5 3 2 2 3 1 5 4 2 0 4 3 3 4 0 1 3 2 3 3 5 5 0 2 4 4 4
 2 4 2 4 1 3 3 5 3 3 2 4 2 3 2 1 3 1 1 4 5 1 1 3 4 3 3 5 3 0 4 5 4 1 1 1 5
 2 1 0 2 0 2 0 0 4 5 1]

True Labels:
[0 3 5 2 2 2 5 3 3 1 2 4 4 2 2 1 0 2 0 2 2 2 2 2 2 2 4 3 3 4 2 5 3 5 0 2 3
 3 0 1 0 2 0 3 2 5 4 2 2 0 0 3 2 2 3 4 0 5 0 3 1 0 0 3 0 5 2 3 2 2 4 2 4 3
 5 1 5 5 1 4 0 4 2 1 0 5 3 3 4 2 4 3 2 1 2 0 4 1 4 2 2 3 1 4 2 4 5 4 1 2 1
 4 4 3 2 0 2 4 5 5 4 3 4 3 5 3 0 1 2 3 4 0 0 4 5 4 4 0 5 1 0 3 3 1 2 2 0 2
 2 4 4 4 2 3 4 1 1 3 0 3 4 5 5 5 5 3 4 1 3 5 3 2 0 2 5 0 1 1 2 1 5 2 2 2 4
 2 2 2 2 3 5 0 4 5 0 1 3 2 3 4 4 5 1 1 5 0 2 5 3 4 2 2 4 3 1 1 4 0 2 2 3 2
 2 0 2 2 3 3 4 2 4 4 0 0 1 4 4 1 5 2 4 0 4 1 5 0 2 2 5 1 2 5 3 3 2 2 0 1 4
 2 0 0 1 2 5 0 2 0 1 0 2 3 4 1 2 4 1 1 4 3 1 2 0 2 4 2 2 3 4 3 4 0 2 2 2 2
 1 2 2 5 4 1 3 5 5 2 3 5 4 3 3 2 3 4 2 3 4 2 3 3 0 3 1 2 5 1 5 3 1 0 4 2 3
 1 2 2 3 2 3 5 5 1 3 5 3 2 2 3 1 5 2 2 0 5 3 3 4 0 1 2 3 3 3 5 5 0 2 2 2 4
 2 4 2 4 1 3 3 5 3 1 2 4 1 3 2 1 3 2 1 4 5 1 1 4 4 3 3 5 2 0 4 5 4 1 1 1 5
 2 2 0 2 1 1 0 0 2 5 1]

Let's have a look on the confusion matrix:

In [71]:
cm = confusion_matrix(y_test_np, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=decoded_dict['Corrected_obesity_level_category'].values(), yticklabels=decoded_dict['Corrected_obesity_level_category'].values())
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')

plt.show()

Another default parameter we didn't consider here is Bootstrap = True. Bootstrapping resamples the original dataset with replacement many thousands of times to create simulated datasets. This process involves drawing random samples from the original dataset. Hence it is better to avoid overfitting, as every tree will be different.

In [72]:
average_df = df.drop(columns = ['Corrected_obesity_level_category', 'MBI', 'Obesity_level_category', 'Weight','Number_of_main_meals', 'Main_transport_Bike', 'Main_transport_Automobile', 'Main_transport_Motorbike','Main_transport_Public_Transportation','Main_transport_Walking', 'Smoke'])

average_people = {col:[] for col in average_df}
for col in average_df.columns:
  average_people[col].append(round(np.mean(df[df['Gender']==1][col])))
  average_people[col].append(round(np.mean(df[df['Gender']==0][col])))

average_people_df = pd.DataFrame(average_people)
average_people_df_scaled = scaler.transform(average_people_df)
average_people_prediction = rfc.predict(average_people_df_scaled)

genders = ['Average man', 'Average woman']
results = [decoded_dict['Corrected_obesity_level_category'][result] for result in average_people_prediction]
matrix_results = np.array([genders,results])
matrix_results
Out[72]:
array([['Average man', 'Average woman'],
       ['Normal_Weight', 'Obesity_Type_III']], dtype='<U16')

Therefore, exclusively from the data contained in this dataset, we can deduce that out of the people interrogated, the average man in all of the categories except MBI, Weight, and Obesity type, probably belongs in the Normal Weight type of obesity, whereas the average woman is predicted to belong in the Obesity_Type_III category (sometimes in the Obesity_Type_III but that depends on the random factor each time we run our model).

This analysis very clearly shows the mispresentation we showed earlier as categorizing the average woman in the Obseity_Type_III (the highest one) is completely unrealistic and unrepresentative of the global population. This is majorly explained by the fact that out of the 264 people categorized in that obesity type, 263 of them are women, which gives WAY too much importance to that feature, much more than it should up to this point.

However it is important to note that the results often change as we refit the algorithm, but this result is the one that occured the most after many tries.

Just out of curiosity, we will test the acccuracy of our algorithm on us, the members of this project :

In [73]:
for col in decoded_dict:
  print(f"{col} : {decoded_dict[col]}")
Gender : {0: 'Female', 1: 'Male'}
Family_history_with_overweight : {1: 'yes', 0: 'no'}
Frequency_eat_high_caloric_food : {0: 'no', 1: 'yes'}
Frequency_eat_between_meals : {2: 'Sometimes', 1: 'Frequently', 0: 'Always', 3: 'no'}
Smoke : {0: 'no', 1: 'yes'}
Monitoring_calories_consumption : {0: 'no', 1: 'yes'}
Frequency_alcohol : {3: 'no', 2: 'Sometimes', 1: 'Frequently', 0: 'Always'}
Obesity_level_category : {1: 'Normal_Weight', 5: 'Overweight_Level_I', 6: 'Overweight_Level_II', 2: 'Obesity_Type_I', 0: 'Insufficient_Weight', 3: 'Obesity_Type_II', 4: 'Obesity_Type_III'}
Age : Numeric values
Height : Numeric values
Weight : Numeric values
MBI : Numeric values
Frequency_eat_vegetables : {1: 'Never', 2: 'Sometimes', 3: 'Always'}
Number_of_main_meals : {1: '1/day', 2: '2/day', 3: '3/day', 4: '3+/day'}
Frequency_water : {1: 'Less than 1L/day', 2: 'Between 1 and 2L/day', 3: 'More than 2L/day'}
Frequency_physical_activity : {0: 'No activity', 1: '1 or 2 days/week', 2: '2 or 4 days/week', 3: '4+ days/week'}
Time_using_technology_devices : {0: '0-2 hours/day', 1: '3-5 hours/day', 2: '5+ hours/day'}
Main_transport_Automobile : {0: 'no', 1: 'yes'}
Main_transport_Bike : {0: 'no', 1: 'yes'}
Main_transport_Motorbike : {0: 'no', 1: 'yes'}
Main_transport_Public_Transportation : {0: 'no', 1: 'yes'}
Main_transport_Walking : {0: 'no', 1: 'yes'}
Corrected_obesity_level_category : {0: 'Insufficient_Weight', 1: 'Normal_Weight', 2: 'Overweight', 3: 'Obesity_Type_I', 4: 'Obesity_Type_II', 5: 'Obesity_Type_III'}

This is simply to help us know which value we should assign to ourselves.

In [74]:
group_members = {'Gender': [1,1,1], 'Age': [21,20,21], 'Height': [1.80,1.76,1.74], 'Family_history_with_overweight': [1,0,0],
                 'Frequency_eat_high_caloric_food': [1,1,1], 'Frequency_eat_vegetables': [2,2,2], 'Frequency_eat_between_meals': [1,1,2],
                 'Frequency_water': [2,2,2], 'Monitoring_calories_consumption': [0,0,1], 'Frequency_physical_activity': [2,1,1],
                 'Time_using_technology_devices': [2,2,2], 'Frequency_alcohol': [3,2,2]}

group_members_df = pd.DataFrame(group_members)
group_members_df_scaled = scaler.transform(group_members_df)
group_members_prediction = rfc.predict(group_members_df_scaled)

members = ['Louis', 'Killian', 'Marc']
predicted_results = [decoded_dict['Corrected_obesity_level_category'][result] for result in group_members_prediction]
matrix_predicted_results = np.array([members,predicted_results])
print(matrix_predicted_results)
[['Louis' 'Killian' 'Marc']
 ['Insufficient_Weight' 'Normal_Weight' 'Normal_Weight']]
In [75]:
members_weight = [67.0, 59.0, 65.0]
real_MBI = [members_weight[i]/group_members['Height'][i]**2 for i in range(len(members))]
real_results = [decoded_dict['Corrected_obesity_level_category'][GetCategory(mbi)] for mbi in real_MBI]

matrix_real_results = np.array([members,real_MBI,real_results])
print(matrix_real_results)
[['Louis' 'Killian' 'Marc']
 ['20.679012345679013' '19.047004132231404' '21.469150482230148']
 ['Normal_Weight' 'Normal_Weight' 'Normal_Weight']]

VII. Conclusion

Therefore, we now have a model capable of predicting with a relatively precise accuracy the obesity type of a person depending on all of the features presented in this dataset (except for the Weight column which gives too much information).

Based on our tests and analyses from earlier, we can conclude that our algorithm must have a pretty good basis since all of the predictions applied to our group ended up being correct.

However, we know that it is still very open to errors, as there are many columns that will completely bias our model's predictions, more so than what the features should impact realistically, as as huge part of our data has been artificially generated by oversampling, which leads to an overfitting of certain features.

For example, there are precisely 264 rows that belong to the Obesity_Type_III, which is the most dangerous type in terms of health, but they are extremely poorly distributed.

Indeed, out of these 264 rows, only one of them is a male. Even though this could theorically be plausible, it seems unlikely that out of the group of people interrogated there was such a discriminated representation of that obesity type category. And this is only for the Gender column, there are many other features that have the same problem applied for them.

In the end, oversampling the dataset to get a larger and broader view over the repartition of people belonging to each category was an interesting idea, but was executed poorly, and gave too much of a bias to the dataset.


That being said, with our model now complete and functional, we could apply it to any person given they fill all of their personal data for each of the features required, but we could not find another dataset with these exact same columns, so we only applied it to the members of our group to prove its efficiency.

After multiple tries, we found out that our model always predicts the right category for Killian, but for Louis it seem to switch from Insufficient_Weight to Normal_Weight each time we generate a new model, and for Marc it switches from Normal_Weight to Overweight, so the little part of uncertainty is still clearly visible.

However, our group lacks diversity since all three of us belong in the Normal_Weight category, but we also have similar eating habits and physical conditions. What would have been interesting was to test our algorithm on people with very different habits than us to see how our model would perform with a wider range of subjects.

If you now wish to test our model personaly or with fictional values, you can test our API application and fill the form with all of the data, and you'll find out what obesity type our model predicts for you.

Thank you for your attention throughout this notebook, we hope that you enjoyed reading our reports and analyses !

duck.gif